Paper 1, Section II, J

Part II, 2020

(a) Let $Z_{1}, \ldots, Z_{n}$ be i.i.d. random elements taking values in a set $\mathcal{Z}$ and let $\mathcal{F}$ be a class of functions $f: \mathcal{Z} \rightarrow \mathbb{R}$ . Define the Rademacher complexity $\mathcal{R}_{n}(\mathcal{F})$ . Write down an inequality relating the Rademacher complexity and

\mathbb{E}\left(\sup _{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n}\left(f\left(Z_{i}\right)-\mathbb{E} f\left(Z_{i}\right)\right)\right)

State the bounded differences inequality.

(b) Now given i.i.d. input-output pairs $\left(X_{1}, Y_{1}\right), \ldots,\left(X_{n}, Y_{n}\right) \in \mathcal{X} \times\{-1,1\}$ consider performing empirical risk minimisation with misclassification loss over the class $\mathcal{H}$ of classifiers $h: \mathcal{X} \rightarrow\{-1,1\}$ . Denote by $\hat{h}$ the empirical risk minimiser [which you may assume exists]. For any classifier $h$ , let $R(h)$ be its misclassification risk and suppose this is minimised over $\mathcal{H}$ by $h^{*} \in \mathcal{H}$ . Prove that with probability at least $1-\delta$ ,

R(\hat{h})-R\left(h^{*}\right) \leqslant 2 \mathcal{R}_{n}(\mathcal{F})+\sqrt{\frac{2 \log (2 / \delta)}{n}}

for $\delta \in(0,1]$ , where $\mathcal{F}$ is a class of functions $f: \mathcal{X} \times\{-1,1\} \rightarrow\{0,1\}$ related to $\mathcal{H}$ that you should specify.

(c) Let $Z_{i}=\left(X_{i}, Y_{i}\right)$ for $i=1, \ldots, n$ . Define the empirical Rademacher complexity $\hat{\mathcal{R}}\left(\mathcal{F}\left(Z_{1: n}\right)\right)$ . Show that with probability at least $1-\delta$ ,

R(\hat{h})-R\left(h^{*}\right) \leqslant 2 \hat{\mathcal{R}}\left(\mathcal{F}\left(Z_{1: n}\right)\right)+2 \sqrt{\frac{2 \log (3 / \delta)}{n}}

comment

Paper 2, Section II, J

Mathematics of Machine Learning

Part II, 2020

(a) Let $\mathcal{F}$ be a family of functions $f: \mathcal{X} \rightarrow\{0,1\}$ . What does it mean for $x_{1: n} \in \mathcal{X}^{n}$ to be shattered by $\mathcal{F}$ ? Define the shattering coefficient $s(\mathcal{F}, n)$ and the $V C$ dimension $\operatorname{VC}(\mathcal{F})$ of $\mathcal{F}$

Let

\mathcal{A}=\left\{\prod_{j=1}^{d}\left(-\infty, a_{j}\right]: a_{1}, \ldots, a_{d} \in \mathbb{R}\right\}

and set $\mathcal{F}=\left\{\mathbf{1}_{A}: A \in \mathcal{A}\right\}$ . Compute $\operatorname{VC}(\mathcal{F})$ .

(b) State the Sauer-Shelah lemma.

(c) Let $\mathcal{F}_{1}, \ldots, \mathcal{F}_{r}$ be families of functions $f: \mathcal{X} \rightarrow\{0,1\}$ with finite VC dimension $v \geqslant 1$ . Now suppose $x_{1: n}$ is shattered by $\cup_{k=1}^{r} \mathcal{F}_{k}$ . Show that

2^{n} \leqslant r(n+1)^{v} .

Conclude that for $v \geqslant 3$ ,

\mathrm{VC}\left(\cup_{k=1}^{r} \mathcal{F}_{k}\right) \leqslant 4 v \log _{2}(2 v)+2 \log _{2}(r)

[You may use without proof the fact that if $x \leqslant \alpha+\beta \log _{2}(x+1)$ with $\alpha>0$ and $\beta \geqslant 3$ , then $x \leqslant 4 \beta \log _{2}(2 \beta)+2 \alpha$ for $x \geqslant 1$ .]

(d) Now let $\mathcal{B}$ be the collection of subsets of $\mathbb{R}^{p}$ of the form of a product $\prod_{j=1}^{p} A_{j}$ of intervals $A_{j}$ , where exactly $d \in\{1, \ldots, p\}$ of the $A_{j}$ are of the form $\left(-\infty, a_{j}\right]$ for $a_{j} \in \mathbb{R}$ and the remaining $p-d$ intervals are $\mathbb{R}$ . Set $\mathcal{G}=\left\{\mathbf{1}_{B}: B \in \mathcal{B}\right\}$ . Show that when $d \geqslant 3$ ,

\mathrm{VC}(\mathcal{G}) \leqslant 2 d\left[2 \log _{2}(2 d)+\log _{2}(p)\right]

comment

Paper 4, Section II, J

Mathematics of Machine Learning

Part II, 2020

Suppose we have input-output pairs $\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right) \in \mathbb{R}^{p} \times\{-1,1\}$ . Consider the empirical risk minimisation problem with hypothesis class

\mathcal{H}=\left\{x \mapsto x^{T} \beta: \beta \in C\right\}

where $C$ is a non-empty closed convex subset of $\mathbb{R}^{p}$ , and logistic loss

\ell(h(x), y)=\log _{2}\left(1+e^{-y h(x)}\right),

for $h \in \mathcal{H}$ and $(x, y) \in \mathbb{R}^{p} \times\{-1,1\}$ .

(i) Show that the objective function $f$ of the optimisation problem is convex.

(ii) Let $\pi_{C}(x)$ denote the projection of $x$ onto $C$ . Describe the procedure of stochastic gradient descent (SGD) for minimisation of $f$ above, giving explicit forms for any gradients used in the algorithm.

(iii) Suppose $\hat{\beta}$ minimises $f(\beta)$ over $\beta \in C$ . Suppose $\max _{i=1, \ldots, n}\left\|x_{i}\right\|_{2} \leqslant M$ and $\sup _{\beta \in C}\|\beta\|_{2} \leqslant R$ . Prove that the output $\bar{\beta}$ of $k$ iterations of the SGD algorithm with some fixed step size $\eta$ (which you should specify), satisfies

\mathbb{E} f(\bar{\beta})-f(\hat{\beta}) \leqslant \frac{2 M R}{\log (2) \sqrt{k}}

(iv) Now suppose that the step size at iteration $s$ is $\eta_{s}>0$ for each $s=1, \ldots, k$ . Show that, writing $\beta_{s}$ for the $s$ th iterate of SGD, we have

\mathbb{E} f(\tilde{\beta})-f(\hat{\beta}) \leqslant \frac{A_{2} M^{2}}{2 A_{1}\{\log (2)\}^{2}}+\frac{2 R^{2}}{A_{1}}

where

\tilde{\beta}=\frac{1}{A_{1}} \sum_{s=1}^{k} \eta_{s} \beta_{s}, \quad A_{1}=\sum_{s=1}^{k} \eta_{s} \quad \text { and } \quad A_{2}=\sum_{s=1}^{k} \eta_{s}^{2}

[You may use standard properties of convex functions and projections onto closed convex sets without proof provided you state them.]

comment

Paper 1, Section II, J

Mathematics of Machine Learning

Part II, 2021

Let $\mathcal{H}$ be a family of functions $h: \mathcal{X} \rightarrow\{0,1\}$ with $|\mathcal{H}| \geqslant 2$ . Define the shattering coefficient $s(\mathcal{H}, n)$ and the $V C$ dimension $\mathrm{VC}(\mathcal{H})$ of $\mathcal{H}$ .

Briefly explain why if $\mathcal{H}^{\prime} \subseteq \mathcal{H}$ and $\left|\mathcal{H}^{\prime}\right| \geqslant 2$ , then $\mathrm{VC}\left(\mathcal{H}^{\prime}\right) \leqslant \mathrm{VC}(\mathcal{H})$ .

Prove that if $\mathcal{F}$ is a vector space of functions $f: \mathcal{X} \rightarrow \mathbb{R}$ with $\mathcal{F}^{\prime} \subseteq \mathcal{F}$ and we define

\mathcal{H}=\left\{\mathbf{1}_{\{u: f(u) \leqslant 0\}}: f \in \mathcal{F}^{\prime}\right\}

then $\operatorname{VC}(\mathcal{H}) \leqslant \operatorname{dim}(\mathcal{F})$ .

Let $\mathcal{A}=\left\{\left\{x:\|x-c\|_{2}^{2} \leqslant r^{2}\right\}: c \in \mathbb{R}^{d}, r \in[0, \infty)\right\}$ be the set of all spheres in $\mathbb{R}^{d}$ . Suppose $\mathcal{H}=\left\{\mathbf{1}_{A}: A \in \mathcal{A}\right\}$ . Show that

\mathrm{VC}(\mathcal{H}) \leqslant d+2

$\left[\right.$ Hint: Consider the class of functions $\mathcal{F}^{\prime}=\left\{f_{c, r}: c \in \mathbb{R}^{d}, r \in[0, \infty)\right\}$ , where

f_{c, r}(x)=\|x\|_{2}^{2}-2 c^{T} x+\|c\|_{2}^{2}-r^{2}

comment

Paper 2, Section II, J

Mathematics of Machine Learning

Part II, 2021

(a) What is meant by the subdifferential $\partial f(x)$ of a convex function $f: \mathbb{R}^{d} \rightarrow \mathbb{R}$ at $x \in \mathbb{R}^{d}$ ? Write down the subdifferential $\partial f(x)$ of the function $f: \mathbb{R} \rightarrow \mathbb{R}$ given by $f(x)=\gamma|x|$ , where $\gamma>0$ .

Show that $x$ minimises $f$ if and only if $0 \in \partial f(x)$ .

What does it mean for a function $f: \mathbb{R}^{d} \rightarrow \mathbb{R}$ to be strictly convex? Show that any minimiser of a strictly convex function must be unique.

(b) Suppose we have input-output pairs $\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right) \in\{-1,1\}^{p} \times\{-1,1\}$ with $p \geqslant 2$ . Consider the objective function

f(\beta)=\frac{1}{n} \sum_{i=1}^{n} \exp \left(-y_{i} x_{i}^{T} \beta\right)+\gamma\|\beta\|_{1}

where $\beta=\left(\beta_{1}, \ldots, \beta_{p}\right)^{T}$ and $\gamma>0$ . Assume that $\left(y_{i}\right)_{i=1}^{n} \neq\left(x_{i 1}\right)_{i=1}^{n}$ . Fix $\beta_{2}, \ldots, \beta_{p}$ and define

\kappa_{1}=\sum_{\substack{1 \leqslant i \leqslant n: \\ x_{i 1} \neq y_{i}}} \exp \left(-y_{i} \eta_{i}\right) \quad \text { and } \quad \kappa_{2}=\sum_{i=1}^{n} \exp \left(-y_{i} \eta_{i}\right)

where $\eta_{i}=\sum_{j=2}^{p} x_{i j} \beta_{j}$ for $i=1, \ldots, n$ . Show that if $\left|2 \kappa_{1}-\kappa_{2}\right| \leqslant \gamma$ , then

\operatorname{argmin}_{\beta_{1} \in \mathbb{R}} f\left(\beta_{1}, \beta_{2}, \ldots, \beta_{p}\right)=0 .

[You may use any results from the course without proof, other than those whose proof is asked for directly.]

comment

Paper 4, Section II, J

Mathematics of Machine Learning

Part II, 2021

Let $D=\left(x_{i}, y_{i}\right)_{i=1}^{n}$ be a dataset of $n$ input-output pairs lying in $\mathbb{R}^{p} \times[-M, M]$ for $M \in \mathbb{R}$ . Describe the random-forest algorithm as applied to $D$ using decision trees $\left(\hat{T}^{(b)}\right)_{b=1}^{B}$ to produce a fitted regression function $f_{\mathrm{rf}}$ . [You need not explain in detail the construction of decision trees, but should describe any modifications specific to the random-forest algorithm.]

Briefly explain why for each $x \in \mathbb{R}^{p}$ and $b=1, \ldots, B$ , we have $\hat{T}^{(b)}(x) \in[-M, M]$ .

State the bounded-differences inequality.

Treating $D$ as deterministic, show that with probability at least $1-\delta$ ,

\sup _{x \in \mathbb{R}^{p}}\left|f_{\mathrm{rf}}(x)-\mu(x)\right| \leqslant M \sqrt{\frac{2 \log (1 / \delta)}{B}}+\mathbb{E}\left(\sup _{x \in \mathbb{R}^{p}}\left|f_{\mathrm{rf}}(x)-\mu(x)\right|\right),

where $\mu(x):=\mathbb{E} f_{\mathrm{rf}}(x)$ .

$\left[\right.$ Hint: Treat each $\hat{T}^{(b)}$ as a random variable taking values in an appropriate space $\mathcal{Z}$ (of functions), and consider a function $G$ satisfying

G\left(\hat{T}^{(1)}, \ldots, \hat{T}^{(B)}\right)=\sup _{x \in \mathbb{R}^{p}}\left|f_{\mathrm{rf}}(x)-\mu(x)\right|

comment

Mathematics Tripos Papers

Mathematics Of Machine Learning

Paper 1, Section II, J

Paper 2, Section II, J

Paper 4, Section II, J

Paper 1, Section II, J

Paper 2, Section II, J

Paper 4, Section II, J

Mathematics Of Machine Learning